User Guide

This user guide covers the basics of using our library and also the details on the different algorithms included in the library. Our library provides two major programming interfaces:

  1. An API to manage data in a distributed way which is built around the concept of distributed arrays (ds-arrays).

  2. An estimator-based interface to work with different machine learning models which is inspired by scikit-learn.

The term estimator-based interface means that all the machine learning models in our library are as estimator objects. Our library estimators implement the same API as scikit-learn, which is mainly based on the fit and predict operators. The typical workflow in our library consists of the following steps:

  1. Reading input data into a ds-array
  2. Creating an estimator object
  3. Fitting the estimator with the input data
  4. Getting information from the model’s estimator or applying the model to new data

An example of performing K-means clustering with our library is as follows:

import our-library as ds
from our-library.cluster import kmeans

# load data into a ds-array
x = ds.load_txt_file("/path/to/file", block_size=(100, 100))

# create estimator object
kmeans = KMeans(n_clusters=10)

# fit estimator
kmeans.fit(x)

# get information from the model
cluster_centers = kmeans.centers

Although the code above looks completely sequential, all our library algorithms and operations are parallelize using PyCOMPSs.

How to Run our Library


There are some features regarding our library:

  1. It can be installed and used as a regular Python library but our library makes use of PyCOMPSs directives internally to parallelize all the computation.
  2. Applications using our library need to be executed with PyCOMPSs. This can be done with the runcompass or enqueue_compss commands:

runcompss my_dislib_application.py

For more information on how to start running our library applications, refer to the QuickStart

Distributed Arrays


Distributed arrays (ds-arrays) are the main data structure used in our library. A ds-array is a matrix divided in blocks which are stored remotely. Each block of a ds-array is a NumPy array. Our library provides an API similar to NumPy to work with ds-arrays in a completely sequential way.

There are some factors to be considered here:

  1. All operations on ds-arrays are parallelized with PyCOMPSs.
    • The degree of parallelization is controlled by using the array block size.
  2. The block size defines the number of the rows and columns of each block in a ds-array.
  3. Choosing the right block size is essential to be able to exploit our library full potential.

Choosing the Right Block Size


The ideal block size depends on available resources and also the application. The number of tasks generated by our library application is inversely proportional to the block size. This indicates two factors:

  1. Small blocks allow higher parallelism as computation is divided in more tasks.
  2. A large number of blocks also produce overhead that can have a negative impact on performance.

Therefore, the block size affects the amount of data that can be loaded into memory. Most estimators in our library process ds-arrays in blocks of rows. This indicates that the optimal block size when using these estimators might have as many horizontal blocks as available processors. The diagram below shows how the K-means estimator would process an 8x8 ds-array split in different block sizes.

  • Using 4x4 blocks generates 2 tasks
  • Using 2x8 blocks generates 4 tasks

Using 2x4 blocks provides the same parallelism as 2x8 blocks, but it has the overhead of dealing with five additional blocks.

Another factor to take into consideration when choosing block size is task granularity. As it has been mentioned before, the number of tasks created by our library is proportional to the number of blocks. Also, block size is directly proportional to task duration or granularity. This is relevant as task scheduling in distributed environments requires communicating with a remote computer and transferring some data, which has a significant cost. Therefore, long tasks (big blocks) are normally more efficient than short tasks (small blocks).

Summary

To summarize, there is a trade-off between the amount of parallelism, scheduling overhead and memory usage that highly depends on your platform. However, there are some main factors needed to be consider when choosing your block size:

  1. Ensure that a block of rows fits in the memory of a single processor.
  2. Define NxN blocks, where N is the number of processors you want to use.
  3. For small ds-arrays, it is better to use N< number of processors and increase granularity at the cost of reducing parallelism.

Creating Arrays


Our library can create ds-arrays in two ways:

1.Start from the beginning:

  • The API reference contains the full list of available routines such as random_array that can create ds-array with random data.
import our library as ds

x = ds.random_array (shape=(100, 100)), block_size=(20, 20)

2.Use existing data

  • By reading data from a file in this case, our library supports common data format such as CSV, SVMLight.

Slicing


Similar to NumPy arrays, ds-arrays provide different types of slicing. The result of an slicing operation is a new ds-array with a subset of elements of the original ds-array.

Currently, these are the supported slicing methods:

x[i]
returns the ith row of x.

x[i,j]
returns the element at the (i,j) position.

x[i:j]
returns a set of rows (from i to j), where i and j are optional.

x[:, i:j]
returns a set of columns (ftom i to j), where i and j are optional.

x[[i,j,k]]
returns a set of non-consecutive columns.

x[:, [i,j,k]]
returns a set of non-consecutive columns.

x[i:j, k:m]
returns a set of elements, where i, j, m, and n are optional.

Classification


The module of our library classification includes estimators that can be used for predicting the classes of unlabeled data. Each estimator implements the fit method to build the model and also the predict method to classify new data.

The input of the fit method have two ds-arrays:

  1. A ds-array x of shape [n_sample, n_features] holding the training samples.
  2. Ads-array y of integer values, shape [n_samples], holding the class labels for training samples.

The predict method takes a single ds-array with the samples to be classified. These ds-arrays can be loaded using one of the our-library.data methods.

Comparison of classification methods: